A Comparison between Decision Tree and Random Forest in Determining the Risk Factors Associated with Type 2 Diabetes

Background: We aimed to identify the associated risk factors of type 2 diabetes mellitus (T2DM) using data mining approach, decision tree and random forest techniques using the Mashhad Stroke and Heart Atherosclerotic Disorders (MASHAD) Study program. Study design: A cross-sectional study. Methods: The MASHAD study started in 2010 and will continue until 2020. Two data mining tools, namely decision trees, and random forests, are used for predicting T2DM when some other characteristics are observed on 9528 subjects recruited from MASHAD database. This paper makes a comparison between these two models in terms of accuracy, sensitivity, specificity and the area under ROC curve. Results: The prevalence rate of T2DM was 14% among these subjects. The decision tree model has 64.9% accuracy, 64.5% sensitivity, 66.8% specificity, and area under the ROC curve measuring 68.6%, while the random forest model has 71.1% accuracy, 71.3% sensitivity, 69.9% specificity, and area under the ROC curve measuring 77.3% respectively. Conclusions: The random forest model, when used with demographic, clinical, and anthropometric and biochemical measurements, can provide a simple tool to identify associated risk factors for type 2 diabetes. Such identification can substantially use for managing the health policy to reduce the number of subjects with T2DM .

Introduction ype 2 diabetes mellitus (T2DM) is a major public health problem and its mortality is increasing worldwide 1,2 . WHO predicts the prevalence of T2DM in Iran to be 6.8% in 2025, and this translates to 5215000 citizens of Iran 3 .
The results of Tehran cohort show the prevalence of type T2DM in Iran is 11% 4 and Mashhad cohort states this prevalence as 14% 5 .
T2DM is one of the most serious challenges for developing countries in the 21 st century 6,7 . Diabetes has its roots in interactions between genetic, environmental and behavioral characteristics 8,9 . Cardiovascular diseases particularly are responsible for 80% of deaths due to T2DM 10 . Dominant possible risk factors in the development of T2DM are ethnicity, obesity, unhealthy diet, lack of physical activity, insulin resistance, and family history of diabetes 11 . Heart disease, stroke, blindness, kidney disease, and amputations are associated with diabetes 12 . It is therefore essential to identify and diagnose individuals that run a high risk of T2DM 6 13 .
In recent decades, different researchers in Iran have used data mining methods such as decision tree, neural network, support vector machine, random forest to predict the associated risk factors of T2DM 5,14 . One reason for not using classical statistical method is the number of predictors which the classical methods cannot select them conveniently. These two models, decision tree, and random forest are two of classification models and there are not so many studies in this regard.
Data mining is a new collection of statistical methods used to characteristics significantly associated with T2DM 15,16 . Data mining can discover new factors and also find relationships among factors that can reveal patterns and develop predictions based on new factors associated with T2DM 17,18 . There are not many studies regarding associated risk factors of T2DM using data mining algorithms in Eastern Iran until yet. In this study, we developed the predicting model to T identify associated risk factors of T2DM as a supplement in screening and public health in Eastern Iran.

Participants
The MASHAD study started in 2010 and will continue until 2020. The city of Mashhad is located in the north-eastern part of Iran. The total population of Mashhad was estimated using the national Iranian census of 2006 so the sample size was determined accordingly. Participants were enrolled from three regions of Mashhad. Each region was divided into nine sites centered at Mashhad Healthcare Center divisions. Overall, 9528 subjects were enrolled as a part of MASHAD study 19 .
This protocol was approved by the Ethics Committee of MUMS, and an informed written consent was obtained from every participant.
Demographic characteristics such as age, gender, marital status, education, cigarette smoking habit, physical activity level (PAL), family history of diabetes (FHD) and depression score were collected from all the subjects. The Beck's depression inventory-II (BDI-II) was used to evaluate the depression. Anthropometric information including weight, height, waist and hip circumference were obtained. Systolic and diastolic blood pressures were measured as described earlier 19 . Biochemical parameters included: fasted serum triglycerides (TGs), total cholesterol (TC), HDL-cholesterol and LDL-cholesterol, fasting blood glucose (FBG) and hs-CRP were measured as previously described 19 . Diagnosed T2DM was identified based fasting blood glucose (FBG) ≥126 mg/dl 20 .

Input variables
The final data contains 9528 records and 18 variables, divided into 17 predictor variables and one outcome or target variable. The target variable has two possible states, namely occurrence of T2DM or no occurrence of T2DM. Demographic characteristics included age, gender, body mass index (BMI), marital status, level of education and biochemical markers, physical activity level (PAL), cigarette smoking habits, family history of diabetes (FHD) and depression score were considered as predictors (Table 1-2).

Decision tree model
A decision tree is a non-parametric method named according to the nature of target variable. It is called a classification tree if the target variable is categorical and a regression tree if the target variable is continuous. The purpose of a decision tree is to develop a predictive model in terms of predictor variables. The tree is formed by successively dividing data according to one of the predictor variables. A decision tree consists of three types of nodes: root node, internal nodes, and leaf nodes [21][22][23] . Decision tree algorithms develop splitting criteria at internal nodes to from the tree. The split of a node attempts to minimize the impurity of the node. If a split is unable to achieve any improvement in terms of reducing impurity, the node is not split and is declared as a leaf node. If a split is able to reduce impurity, then the split providing the maximum reduction in impurity is selected and two branches are formed, forming two new nodes. The popular splitting criteria are information Gain, Gini index and gain ratio. CART is one of the decision tree algorithms that construct a binary tree using Gini index for selecting the splitting variable at every internal node. The Gini index at a node D is given by where pi is the probability that an observation in D belongs to the class Ci and is estimated by |Ci, D|/|D| 24,25 . The sum is taken over them possible classes. The tree begins with all observations forming the root node and successive splits determine the order of importance of the predictor variables.

Random Forest
Random forest is an ensemble learning method. It generates many classification trees by selecting subsets of the given dataset and selecting subsets of predictor variables randomly, finally aggregating the results of all models to obtain a random forest. Multiple classification trees are obtained from bootstrap samples in order to arrive at the final "majority" classification rules. The tree training parameters used in the present study are (i) ntree=500, the number of trees generated (ii) ntry=17, the number of predictor variables used in each tree, and (iii) node size=5, the minimum number of observations in a leaf node. Supervised machine learning algorithms divide the data into two parts, namely training data and test data.
One of the most important features of random forest and decision tree is the output of the variable importance. Variable importance measures the degree of association between a given variable and the classification result. Random forest and decision tree have four measures for the variable importance: raw importance score for class 0, raw importance score for class 1, decrease in accuracy and the Gini index 26 .
Statistical analyses were performed using R packages rpart (for decision trees), random Forest (for random forest) and caret. The complete sample contained 1361 individuals with T2DM and the remaining 8167 individuals without T2DM. The present study adopted a 10-fold cross validation method to evaluate decision tree and random forest model. The 10-fold cross validation method involves randomly separating the acquired data sets into 10 data sets that are equal in sample size. The decision tree and random forest models are constructed on the basis of a training data set. The rest of the nine data sets were used as testing data for verifying model effectiveness. Ten repeated empirical tests were conducted, where each subset was used as the test data. The bootstrap (500 replications) optimism-corrected area under the receiver operating characteristic curve (ROC) was estimated using R software.
The decision tree developed on the training data was used to obtain the splitting criteria for different nodes and was then applied to observation in the test data. The resulting tree is used to measure sensitivity, specificity, and accuracy of the model. If values of these measures are high for training data and lower for test data, it is considered as a case of overfitting. These measures must be obtained on training data as well as on test data in order to establish validity of the model. The models reported in this paper have been validated and results on test data are reported here.
Models are evaluated by constructing the confusion matrix for test data. In addition, accuracy, sensitivity, and specificity are also measured for each model. Accuracy, sensitivity, and specificity of a classification model are defined as follows 27 . The receiver operating characteristic (ROC) curve is the plot that displays the full picture of trade-off between the sensitivity and (1-specificity) across a series of cutoff points. Area under the ROC curve is considered as an effective measure of inherent validity of a diagnostic test.

Results
Anthropometric and biochemical features are summarized in Table 1 and 2, respectively. In general, 1361 (14.3%) people had T2DM. Of 1361 diabetic individuals, 843 (61.9%) were female, 1239 (91%) were married, and 783 (57.5%) were unemployed. Subjects with T2DM showed significantly higher systolic blood pressure, triglyceride, hs-CRP, diastolic blood pressure, serum total cholesterol, and LDL-cholesterol, whereas they showed significantly lower HDL-cholesterol than subjects without T2DM. The mean age of diabetic individuals was higher than non-diabetic individuals (52.01 ±7.2 vs 47.70 ±8.1, P<0.001). The mean BMI of diabetic patients was 28.78 ±4.4 and for non-diabetic persons was 27.76±4.7. The results of the independent t-test showed that the BMI in diabetics was significantly higher than nondiabetic people (P<0.001). The mean PAL of diabetic individuals was lower than non-diabetic individuals (1.59 ±0.86 vs 1.60 ±0.64, P=0.040).
Based on the results of the random forest model, TG, hs-CRP, SBP, LDL, TC, FHD, age, BMI, and PAL were the most important risk factors related to T2DM (Figure 1). In a subgroup with TG>204.5 and hs-CRP≤1.32 and occupation=employment, 79.2% was the probability of not occurrence of T2DM. In the subgroup with TG>204.5 and hs-CRP<1.32 and occupation=unemployment and hs-CRP>4.66, the probability of occurrence of T2DM is 90% (Table 3). Based on the results of the decision tree model, FHD, age, TG, SBP, hs-CRP, BMI, and DBP were the most important risk factors related to T2DM. Figure 2 shows the complete tree produced by CART. The decision tree showed that in a subgroup with FHD=no and TG<184, 92% is the probability of not occurrence of T2DM. In another subgroup, if FHD=yes, age<48 and SBP<140, the T2DM will not occur with probability of 87% (Table 3).

Discussion
We developed a prediction model based on cross-sectional study to predict risk factors of T2DM according to decision tree and random forest models. The random forest model showed that TG, hs-CRP, SBP, LDL, TC, FHD, age, BMI, and PAL were strongly associated with T2DM. The decision tree model found FHD, age, TG, SBP, hs-CRP, BMI, and DBP were strongly associated with occurrence of T2DM. Putting the two results together, TG, FHD, hs-CRP, SBP, and BMI are common associated risk factors of T2DM in the two models. In a cohort study by using a decision tree, TG, family history of T2DM, BMI, SBP, education level and occupation were the associated risk factors of T2DM 4 .
Decision tree algorithm is a classification model based on different predictor variables and is widely being used in medicine [28][29][30] . RF creates multiple classification and regression (CART) trees, each trained on a bootstrap sample of the original training data and searches across a randomly selected subset of input variables to determine the split 31 . The variables such as family history of diabetes, age, triglycerides, LDLcholesterol, body mass index, and physical activity level have already been identified as important associated risk factors of diabetes [32][33][34] . The present study has found hs-CRP as an important associated risk factor of T2DM, but it has not been reported so far 28,33 .
The results of our study showed that family history of diabetes and triglycerides were the most important risk factors related to T2DM in the decision tree and random forest models. In other studies also, family history of diabetes and TG were the most important associated risk factors for T2DM 4,30 .
Decision trees are one of the easiest tools to decision systems and easy to understand. Decision trees can easily convert to if-then rules. Programs based on these rules can be made and used on personal computers for decision analyses, used easily with physicians and health care personnel to conclude the outcomes 4, 35-38 .
In this study, comparison of decision tree and random forest models showed that sensitivity and specificity values of random forest were higher than decision tree which was inconsistency with previous studies 31,39 . On the other hand, sensitivity of C4.5 algorithm was higher than random forest, but specificity of random forest was higher than decision tree (C4.5) 39 . The reason for being difference between sensitivity of them is using different algorithm.
The ROC curve is a technique to visualize, organize, and choose classification based on the performance of the classification. The area under the curve (AUC) is an index of which model performs better and has a high level of accuracy. This index, which compares the performance of true positive and false positive of two different decision extremes, is often used to evaluate the predictive accuracy of classification models 40 .
In the current study, the AUC of random forest of testing dataset was significantly higher than decision tree which was consistent with previous studies 31,39 . Random forest model is an accurate model for investigation of novel predictor markers, which is in line with previous 14,31 .
The strength of the study lies in its large sample size that makes it applicable to general population. One potential limitation of this study is that it is based on a cross-sectional data and cannot obtain results obtained from longitudinal or cohort data.

Conclusions
Random forest models can provide good prediction models due to their efficacy and sensitivity and specificity. According to random forest model, TG and hs-CRP are the most important associated risk factors for T2DM. This study has also identified some new risk factors associated with T2DM indicating the need for further evaluation of clinical applicability of this model.